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Abstract 

In this paper, we consider the filtering and smoothing recursions in nonparametric finite state space hidden 
Markov models (HMMs) when the parameters of the model are unknown and replaced by estimators. We 
provide an explicit and time uniform control of the filtering and smoothing errors in total variation norm as 
a function of the parameter estimation errors. We prove that the risk for the filtering and smoothing errors 
may be uniformly upper bounded by the risk of the estimators. It has been proved very recently that statistical 
inference for finite state space nonparametric HMMs is possible. We study how the recent spectral methods 
developed in the parametric setting may be extended to the nonparametric framework and we give explicit 
upper bounds for the L 2 -risk of the nonparametric spectral estimators. When the observation space is compact, 
this provides explicit rates for the filtering and smoothing errors in total variation norm. The performance of 
the spectral method is assessed with simulated data for both the estimation of the (nonparametric) conditional 
distribution of the observations and the estimation of the marginal smoothing distributions. 


1 Introduction 

1.1 Context and motivations 

Hidden Markov models are popular time evolving models to depict practical situations in a variety of applica¬ 
tions such as economics, genomics, signal processing and image analysis, ecology, environment, speech recog¬ 
nition, see [11] for a recent overview of HMMs. Finite state space HMMs are stochastic processes (Aj, Y])y>i 
such that {Xj)j>\ is a Markov chain with finite state space X and (1 / j)j>i are random variables with general 
state space y, independent conditionally on (X,).,>i and such that for all £ > 1, the conditional distribution 
of Y{ given (Ay)y>i depends on Xu only. The observations are Y 1:n := (TV • • • , Y n ) and the associated states 
X\ :n '■= (AY ■ • • , X n ) are unobserved. The parameters of the model are the initial distribution 7r* of the hid¬ 
den chain, the transition matrix of the hidden chain Q, and the conditional distribution of Y± given X\ = x for 
all possible x € X which are often called emission distributions. 

In many applications of finite state space HMMs (e.g. digital communication or speech recognition), it is of 
utmost importance to infer the sequence of hidden states. Such inference usually involves the computation of the 
posterior distribution of a set of hidden states Xk-.k', 1 < k < k! < n, given the observations Ti :s , 1 < s < n. 
When the initial distribution of the hidden chain, its transition matrix and the conditional distribution of the 
observations are known, this task can be efficiently done using the forward-backward algorithm described in [5] 
and [25]. In this paper, we focus on the estimation of the filtering distributions IP (A - /,. = x | V : / r ) and marginal 
smoothing distributions P(Afc = x\Y\ ■„ ) for all 1 < k < n when the parameters of the HMM are unknown and 
replaced by estimators. Indeed, it has been proved very recently that inference in finite state space nonparametric 
HMMs is possible, see [16]. 

1.2 Contribution 

The aim of our paper is twofold. 

- First, we study how the parameter estimation error propagates to the error made on the estimation of filtering 
and smoothing distributions. Although replacing parameters by their estimators to compute posterior dis¬ 
tributions and infer the hidden states is usual in applications, theoretical results to support this practice are 
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very few regarding the accuracy of the estimated posterior distributions. We are only aware of [15] whose 
results are restricted to the filtering distribution in a parametric setting. When the parameters of the HMM 
are known, the forward-backward algorithm can be extended to general state space HMMs or when the car¬ 
dinality of X is too large using computational methods such as Sequential Monte Carlo methods (SMC), see 
[8, 12] for a review of these methods. In this context, the Forward Filtering Backward Smoothing [21, 19, 13] 
and Forward Filtering Backward Simulation [17] algorithms have been intensively studied, with the objec¬ 
tive of quantifying the error made when the filtering and marginal smoothing distributions are replaced by 
their Monte Carlo approximations. These algorithms and some extensions have been analyzed theoretically 
recently, see for instance [9, 10, 14, 23]. SMC methods may also be used in algorithms when the parameters 
of the HMM are unknown to perform maximum likelihood parameter estimation, see [20] for on-line and 
off-line Expectation Maximization and gradient ascent based algorithms. Part of our analysis of the filtering 
and smoothing distributions is based on the same approach as in those papers and requires sharp forgetting 
properties of HMMs. 

- Second, we extend spectral methods to a nonparametric setting and give explicit control of the L 2 -risk of the 
estimators. Such estimators may then be used in the computation of posterior distributions. In latent variable 
models such as HMMs, spectral methods are popular since they lead to algorithms that are not sensitive to 
a chosen initial estimate. Indeed, standard estimation methods for HMMs are based on the Expectation- 
Maximization (EM) algorithm, which faces intrinsic limitations hard to circumvent such as slow convergence 
and suboptimal local optima. Extending spectral methods to nonparametric HMMs is thus very useful. In 
particular, they may be used to provide a preliminary estimator as starting point in a EM algorithm. They are 
also used in a refinement procedure proposed in [7], To the best of our knowledge, the spectral method has 
not been extended nor studied yet in the nonparametric framework. 

We start from the works of Anandkumar, Hsu, Kakade and Zhang on spectral methods in the parametric 
frame. Their papers [18, 3] present an efficient algorithm for learning parametric HMMs or more generally 
finitely many linear functionals of the parameters of the HMM. Thus, it is possible to use spectral methods 
to estimate the projections of the emission distributions onto nested subspaces of increasing complexity. Our 
work brings a new quantitative insight on the tradeoff between sampling size and approximation complexity 
for spectral estimators. We provide a nonasymptotic precise upper bound of the risk for the variance term 
with respect to the number of observations and the complexity of the approximating subspace. 

1.3 Outline of the paper 

In section 2, we provide an explicit control of the total variation filtering and smoothing errors as a function 
of the parameter estimation error, see Propositions 2.1 and 2.2. We detail the application of these preliminary 
results in the parametric context, see Theorem 2.3, and in the nonparametric context, see Theorem 2.4 where 
we prove that the uniform rate of convergence for the filtering and smoothing errors is driven by the L 1 -risk 
of the nonparametric estimator of the emission distributions. In Section 3, we explain how spectral methods 
can be extended to the nonparametric frame and we provide the nonasymptotic control of the variance term 
in Theorem 3.1. This leads to the asymptotic behavior proved in Corollary 3.2, which may be invoked when 
spectral methods are used in the computation of posterior distributions, see Corollary 2.5 stated in the previous 
section. Finally, in Section 4 we show the performance of the spectral method with simulated data for both the 
estimation of the (nonparametric) conditional distribution of the observations and the estimation of the marginal 
smoothing distributions. All detailed proofs are given in the appendices. 

2 Main results 

2.1 Notations and setting 

In the sequel, it is assumed that the cardinality K of X is known (for ease of notation, X is set to he {1,.... A'}) 
and that y is a subset of R ,J for a positive integer 1). Denote hv 'P(X) the space of probability measures on 
X and write CP the Lebesgue measure on y. For all n > 1 and all x £ X, the density of the conditional 
distribution of Y n given X n = x with respect to C D is written /*. Consider the following assumptions on the 
hidden chain. 

[HI] a) The transition matrix Q , has full rank, 
b) 6* := min > 0. 
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[H2] The initial distribution n* := ( 7 t*,..., n* K ) is the stationary distribution. 


Remark 2.1. Note that under [Hl]-b) and [mi for all k £ X,7r* k > 6* > 0. 

Remark 2.2. Assumptions [Hl]-a) and [H2] appear in spectral methods, see for instance [3, 18], and in 
identiftability issues, see for instance [1, 2, 16], 

For all y £ y, define c*(y) by 

c*(y) ■= min Y Q *(x,x')f*,(y). (1) 

x£X z ' 
x'ex 

For all yi- n £ y n , the filtering distributions <j> k (-,yi-.k) and marginal smoothing distributions <j> k \ n (-, yi-.n) may 
be computed explicitly for all 1 < k < n using the forward-backward algorithm of [5]. In the forward pass, the 
filtering distributions <f> k are updated recursively using, for all x £ X, 


4>*{x,yi) ■= 


1 ) 


'Ex'ex n *( x ')fx’(yi) 


and 4>l(x,y 1:k ) := 


'Yhx’^X 

T,x',x"ex Q*{x', x ")fZ"(yk)<!>i- 1 (x , ,y 1 -.k- 1 


( 2 ) 


Note that for all 1 < k < n, 4> k (x, Y\. k ) = P(2ffc = x\Yi :k ). In the backward pass, the marginal smoothing 
distributions may be updated recursively using, for all x £ X , 


K\n( X ’yi--n) := 4>n{x,yi-.n) and (j>l\ n { x ,yv.n) ■= Y ' X ^k+l\n( X ' ’ V x -n) , (3) 

x'ex 


where, for all v G X and all 1 < k < n. 


Q^(v,u)(/)* k (v,y 1 .. k ) 
£ze*Q*k, «)</>£ k,yi:fc) ’ 


Note that for all 1 < k < n, 4>^ n {x,Y 1:n ) = P(X k = x\Y 1:n ). 


2.2 Preliminary results 

In this paper, the parameters 7r*, Q* and /* are unknown. Then, the recursive equations (2) and (3) may be 
applied replacing 7r*. Q* and f* by some estimators 7f, Q and / to obtain approximations of the filtering and 
smoothing distributions. Using forgetting properties of the hidden chain, we are able to obtain an upper bound of 
the filtering errors and of the marginal smoothing errors by terms involving only the estimation errors of ir*, Q* 
and /*. These upper bounds are given in propositions 2.1 and 2.2. Their proofs are postponed to Appendix A 
and B. Note that the upper bounds are given for any possible values yi :k , k > 1 and may be applied to the set 
of observations for which filtering and smoothing distributions are estimated, whatever the set of observations 
used to estimate 7r*, Q* and /*. Let II • lltv be the total variation norm, || • ||2 the euclidian norm and || • ||j? 
the Frobenius norm. For all 1 < k < n, denote by <j> k and <p k |„ the approximations of <p k and 4>X\ n obtained by 
replacing 7r*, Q+ and /* by the estimators 7 f, Q and / in (2) and (3). 

Proposition 2.1. Assume [Hl]-b) and [H2] hold. Then, for all k > 1 and all y k:k £ y k , 


II <t>U;yi:k) - M;yi:k) lltv < a [pi- 1 Ik* - 5?|| 2 /5* + HQ* - Q||f/(< 5*(1 - p*)) 


Y p* e °* ma -5 fxive) - fx(yt) 

z ' xe X 


£=1 


where p* := 1 — 5*/( 1 — 5*) and C* := 4(1 — S*)/S*. 

Proposition 2.2. Assume [Hl]-b) and [H2] hold. Then, for all 1 < k < n and all yi :n £ y n . 


II$| n 0, J/t:n) - ^|n(-,» 1: n)||tv < C * (pMtT* ~ 7t|| 2 /<5* + [1/(1 - Pir ) + 1/(1 - p)]||Q* - Q||f/<5* 

n 

+ £(pv P*> le ~ klc * 1 (^) ma 5 fxivi) - fx(vi) 

z ' x€lX 

i =1 

where 6 := min XtX > Q(a;, x') and p := 1 — <5/(1 — 5). 
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2.3 Uniform consistency of the posterior distributions 

Propositions 2.1 and 2.2 are preliminary results that can be used to understand how the estimation errors made 
on the parameters of the HMM propagate upon the filtering and smoothing distributions. We assume that we 
are given a set of p + n observations from the hidden Markov model driven by t r*, Q* and /*. The first p 
observations are used to produce the estimators n, Q and / while filtering and smoothing are performed with 
the last n observations. In other words the estimators n, Q and / are measurable functions of \'\- v and the 
objective is to estimate <f>l(-,Y p+1:p+k ) and 4>* k \ n {-, Y p+k . p+n ). 

2.3.1 Parametric models 

In the parametric case, the hidden Markov model depends on a parameter 0* which lies in a subset of E' ; for 
a given q > 1. In this situation, 0* may be estimated by 0 6 R 9 and we may write tt := 7r e , Q := and 

7 := f- 

Theorem 2.3. Assume [HI] and [H2] hold. Assume also that for all x, x' £ X, 9 H > Qg(x, x') is continuously 
differentiable with a bounded derivative in the neighborhood of 0* and that for all x £ X and all y £ y, 
6 i—^ fx(y) continuously differentiable in the neighborhood of 0* and such that the norm of its gradient is 
upper bounded in this neighborhood by a function h x such that f h x (y)d£ D (y) < +oo. Let 8 be a consistent 
estimator of 0*. Then for any 1 < k < n, 

ll^fcG) Y p + l:p+fc) — lp+l:p+fc)||tv — Op(||0 — 0*||2) 


and 

ll ( / > fc|n (’) ^ p+l:p+k) ~ 0k|n(‘> ^p+l:p+fc)l|tv = Op(|| 0 — 0*|h) ■ 

The smoothness assumption in Theorem 2.3 is usual to study the asymptotic distribution of the maximum 
likelihood estimator in parametric HMMs. By Theorem 2.3, tight bounds on the uniform convergence rate 
of MU‘i Y P+i-.p+k) - <f>k(-,Y p+ i:p +fc )||tv and of ||^| n (-, Y p+1 . p+k ) - 7fc|n(-,^>+i: P +fc)l|tv may be derived 
by controlling the estimation error ||9 — 0*||. There exist several results on this error term depending on the 
algorithm used to obtain 9. For instance, [27] provides explicit upper bounds for this error term in the case 
where 9 is a recursive maximum likelihood estimator of 0*, under additional assumptions on the model. 

Proof. First, under [HI] and [H2], the assumption on 0 i —> Qg(x,x') implies that 0 is continuously 

differentiable with a bounded derivative in the neihgborhood of 0*. Notice also that sup fc>1 p* -1 < 1 and 
sup fe >i 'p k ~ 1 < 1. Then using Taylor expansion we easily get that the first two terms of the upper bound in 
Propositions 2.1 and 2.2 are Op(||0 — II 2 ). There just remains to control the last term for each of the upper 

bound in Propositions 2.1 and 2.2. Using a Taylor expansion, Cauchy-Schwarz inequality, and Proposition 2.1, 
we get that for any 1 < k < n, 


k 

Il'/hcOi U>+l:p+fc) — Y p +i-,p+k) 11 tv < Op(||0 — 0*||2) + II 0 — 0* || 2 ^ ^ p* C* ( Yp+t ) 'y ) hxifp+i) ■ 

t- 1 x£X 

As the {Yj)j> 1 are stationary with distribution having density n xfx(y) ^ c *(y)/^*^ the random variable 

P* 1 (Y p +i) yZxex h x (Y p +e) is nonnegative and has expectation upper bounded by 

k 

-Lj2p*- e Yl [ h x(y)d£ D {y) < 17 / h x(y)d£ D {y) <+ 00 . 

e=i xex J ^ ’ xex J 

Thus Y^e= 1 P*~^ C * 1 (Yp+e) Y^xex h x {Y p +(i) = Op(l) so that we get the first point of Theorem 2.3. The result 
for the smoothing distributions follows the same lines after noticing that, for some e > 0 such that p* + e < 1, 
the event {p > p* + e} has probability tending to 0 as p tends to infinity when 0 is a consistent estimator of 
0 *. □ 
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2.3.2 Nonparametric models 


We first state a general theorem that gives a control of the uniform consistency of the posterior distributions in 
terms of the risk of the nonparametric estimators. The theorem also holds in the parametric context, however, the 
parametric literature usually studies the distributional properties of the estimators, while the nonparametric one 
studies mostly the risk. As usual in the hidden Markov model literature, the model parameters are identifiable 
up to permutations of the hidden states labels. Therefore, without loss of generality, the following results are 
stated indicating the prospective permutation of the states. Let Sk be the set of permutations of {1,..., K}. If 
t is a permutation, let P T be the permutation matrix associated with r. 

Theorem 2.4. Assume [Hl]-b) and [H2] hold. Then for all n > 1, for any permutation t p £ Sk, 


SUp E ^p+l:p+k) (') ^'p+l:p+fc)||tv I 

1 <k<n 1 1 

<^{ E llh*-^AIU + E[||Q * 


, v;»f] + - Lwiii] 

x€X 


and 


sup E 

1 <k<n 


II/:-/t p (x)I|i/£ 


ll^fc|n(‘> ^p+l:p+n) ^p+l:p+n)||tv 

< (^2 j E [|K* - P ^pIU + E [IIQ* - Pr p Qp P ;i|F/£] + ]T 

Here, and <t>^ n are the estimation of and <p^ n based on P rp Q P^, P Tp 7r and f Tp ( x ),f or a H x £ X 
Proof. For any x £ X and any 1 < £ < n. 


E 

with 


c^iYp+e) 

fx(Yp+e) - f Tp (x)(Y p+ e) 

= E 

E 

cJiYp+e) 

f*(Y p+e ) - f Tpix) (Y p+e ) 

Y\.. p +t.-i 


E 



fx( z ) - fr p (x)( z ) 


c ir 1 {z)g e (z)dz, 


where g t {z) := Y.x^x^x 4>\-i{xz- 1 ,Y p+1:p+i _{)Ql+(x i - U xz)ff. e {z). By [Hl]-b) and (1), c it 1 {z)g e (z) < 
(1 — S*)/S* and 


E 



Yl:p+t-l 


< (1 - J*)||/* 


fr p (x)\\l/6* ■ 


Therefore, the result for the filtering distributions comes from taking the supremum and then the expectation in 
the upper bound of Proposition 2.1. The proof for the smoothing distributions follows the same steps. □ 

What comes out in Theorem 2.4 is a control driven by the L 1 -risk of the emission densities. In Section 3, 
we propose a spectral method to obtain, in the nonparametric context, estimators of the transition matrix, the 
stationary distribution and the emission densities. The general idea is that of projection methods, so that at the 
end we obtain a control on the L 2 -risk of the emission densities. This control can be easily transfered whenever 
y is a compact subset of R D , since in such a case, for some C{y) > 0 we have, for any square integrable 
functions h± and h^, 

\\h l -hi\\ x <C{y)\\h l -h 2 \\ 2 - (4) 

We end this section by setting the result that follows when using the spectral estimators. Let (M r ) r >i be an 
increasing sequence of integers, and let (l$M r )r>l be a sequence of nested subspaces such that their union 
is dense in L 2 (jy, C D ). Let $>M r '■= {<Pi ; • • ■, PM r } be an orthonormal basis of *$M r - Note that for all 

f£L 2 (y,c D ), 

M r 

lim V' (/, Pm)Pm = f , (5) 

p—foo z - 

m =1 
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in L 2 (3^, C d ). Note also that changing M r may change all functions p r , 1 < m < M r in the basis <I>M r , which 
will not be indicated in the notation for better clarity. We shall also drop the index r and write M instead of M r . 
The spectral estimators of the emission densities will be projection estimators. Let us denote ,,..., K 
the projections of the emission densities on the space m, that is, for x £ X, 

M 

fM,x = (fx > Tm) l Pm- 

m =1 

We need a further assumption, which, together with [Hl]-b) and [H2], has been proved sufficient to get identi- 
fiability in nonparametric HMMs, see [16], 

[H3] The family of emission densities 3* := {/{,.... ffi} is linearly independent. 

Finally, the following quantity is needed in the control of the L 2 -risk of the spectral estimators. For any M, 
define 

M 

vl ($m) := sup V (<p a {yi)<Pb(y2)<Pc{y3) - <Pa{y'i)<Pb(y2)<Pc(y'3)) • (6) 

V’V'^y 3 a,b,c—l 

Applying Theorem 2.4 and (4) we get the following corollary whose proof is omitted: the first point is an 
application of Corollary 3.2, the second point is obtained following the same lines as the proof of Corollary 3.2. 

Corollary 2.5. Assume [H1]-[H3] hold. Assume also that for all x £ X, f* £ L 2 (y,C D ). Let M p be a 
sequence of integers tending to infinity such that p 3 (&M p ) = o{\Jp/ logp). For each p, define f, Q and tt as 
the estimators obtained by the spectral algorithm given in Section 3 with this choice of M p . Then, there exists 
a sequence of permutations r p £ Sk such that 


E 


SUp ||</>fc(', Lp_|_i :p _| -k) fiff ("i Yp+l:p+k) ||tv 

,fe>l 


0(r] 3 {^M p )V lo SP/P+ Wfx~fM p ,xh) 

x£X 


and 


E 


SUp || < ^fe| r j(‘) ^p+l-.p+n) ^ p+l:p+n 

.1 <k<n 1 


tv 


o(v 3 {^m p )V^ 0 sp/p+ Wfx-fM p ,xh)- 

x£X 


One may consider the following standard examples. 

- (Spline) The space of piecewise polynomials of degree bounded by d r based on the regular partition with pf 
regular pieces on y. It holds that M r = ( d r + l) D py > . 

- (Trig.) The space of real trigonometric polynomials on y with degree less than r. It holds that M r = 
(2r + l) D . 

- (Wav.) A wavelet basis ( 1> \f r of scale r on y, see [22]. It holds that M r = 2^ r+1 ^ D . 

In those examples, there exists a constant C v > 0 such that rgfiM) < C v M k / 2 , so that the rate of uniform 
convergence for the posterior probabilities is 0(M p 3 ^ 2 y/logp/p + Ylxex II fx ~ flt p , x lb)- 


3 Nonparametric spectral estimation of HMMs 

3.1 Description of the spectral method 

This section describes a tractable approach to get nonparametric estimators of the emission densities and of the 
transition matrix. Our procedure relies on the estimation of the projections of the emission laws onto nested 
subspaces of increasing complexity. This allows to illustrate the uniform consistency result provided in the 
previous section. 

Recall that (]3M r )r>t is a sequence of nested subspaces of L 2 (y,C D ) associated with their orthonormal 
basis (T.vf,.) r >i. Since projections are linear functionals of the distributions, it is possible to use spectral 
methods to estimate the projections of the emission distributions on the basis <1> m for each M. To this end, our 
approach is based on the work described in [3], In particular, we follow their strategy to get an estimation of the 


6 



emission densities. However, the dependency in the dimension is of crucial importance in the nonparametric 
framework and it has not been addressed in [3], Hence, we present in Theorem C.3 a new quantitative version 
of the work [3] that accounts for the dimension M. Moreover, the authors of [3] invoke a way of estimating 
the transition matrix Q+ but they do not give any theoretical garantees regarding this estimator. In this paper, 
we introduce a slightly different estimator that is based on a surrogate tt (see Step 8 of Algorithm 1) of the 
stationary distribution. Our estimator (see Step 9 of Algorithm 1) is then build from the ’’observable” operator 
(rather than its left singular vectors as done in [3]). Eventually, Theorem C.2 gives the theoretical guarantees of 
our estimator of the transition matrix and its stationnary distribution. 

The computation of those estimators is particularly simple: it is based on one singular value decomposition, 
matrix inversions and one diagonalization. It is proved in Theoremn C.2 and C.3 that, with overwhelming 
probability, all the matrix inversions and the diagonalization can be done rightfully. 

For all (p x q) matrix A with p > q, denote by <Ji(A) > 172 (A) > ... > <J q (A) > 0 its singular values 
and ||-|| its operator norm. When A is invertible, let n(A) := cf\ ( A)/a q ( A) be its condition number. A T is 
the transpose matrix of A, A(£,£') its (f,f')th entry, A(. ,£) its fth column and A(k ,.) its fcth line. When 
A is a (p x p) diagonalizable matrix, its eigenvalues are written Ai(A) > A 2 (A) > ... > A P (A). For any 
1 < q < + 00 , ||-||g is the usual L q norm for vectors. For any row or column vector v, denote by XHa 0 [i>] the 
diagonal matrix with diagonal entries Vi. The following vectors, matrices and tensors are used throughout the 
paper: 

- L m G R m is the projection of the distribution of one observation on the basis <3 ?m: for all a G {1,..., M}, 

L M (a) := E[^ 0 (Yi)] ; 

- N m G R MxM is the joint distribution of two consecutive observations: for all (a, b) G {1 

N M (a,b) := ^[p a (Y 1 )ip b (Y 2 )\ ; 

- Mm G M MxMxM is the joint distribution of three consecutive observations: for all (a, b, c) G {1, ..., M} 3 , 

M M (a,b,c) :=E[ip a (Y 1 )p b (Y 2 )ipc(Y 3 )] ; 

- O m G R MxA " is the conditional distribution of one observation on the basis $m: for all (rn, a;) G {l,...,M}x 

O M (m,x) := E[v? m (Yi)|Xi = x] = ; 

- For all x G X, x is the projection of the emission laws on the subspace (Pm: , /m x := E!ti Om(w, x)(p m . 

Write f£f := (/m ;1 > • • • 1 Im,k ) ! 

- Pm g M MxM is the joint distribution of (Tj, Kj): for all (a,c) G {1 ,...,M} 2 , Pm(ii,c) :=¥.[p a {Y{)ip c {Y 3 )\. 


3.2 Variance of the spectral estimators 

This section displays the results which allow to derive the asymptotic properties of the spectral estimators. The 
aim of Theorem 3.1 is to provide an upper bound for the variance term with an explicit dependency with respect 
to both p and M. The way it depends in M is described by the quantity y /3 defined in ( 6 ). Recall that, in the 
examples (Spline), (Trig.) and (Wav.), we have 773 (^m) < C V M 3 / 2 with C n > 0 a constant. In this section, 
assumption [HI] may be replaced by the following weaker assumption [HI’]. 

[HI’] a) The transition matrix Q, has full rank, 
b) (X n ) „>i is irreducible and aperiodic. 

Note that under [HP] and [H2], there exists 7T* nin > 0 such that, for all x G X, 


_x \ 

n x > ^min • 


(7) 


Theorem 3.1 (Spectral estimators). Assume that [HI’] and [H2]-[H3] hold. Assume also that for all x G X, 
f* G lA(y,C D ). Then, there exist positive constant u( Q*), C(Q*,3*) and N(Q*,3*) such that for any 
u > u( Q*), any 6 G (0,1), any AT > Mg*, there exists a permutation tm G Sk such that the spectral 
method estimators fM,x> tt and Q (see Algorithm 1) satisfy, for any p > N(Q*, J*)? 73 ($m) 2 w( — log<5)/<5 2 , 
with probability greater than 1 — 25 — 4e - “, 


m ax||/ M ,x /m,tm(i) 


2<c(Q\r 


V- log <5 77 3 (<T>aj) 

« s/V 
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Algorithm 1: Nonparametric spectral estimation of the transition matrix and the emission laws 
Data: An observed chain (Yj,..., Y p+2 ) and a number of hidden states K. 

Result: Spectral estimators n, Q and (fM,x)x&x- 

[Step 1] For all a, b, cin {1,..., M}, consider the following empirical estimators: 

L M (a) := E!=i <Pa(Y s )/p, M M (a, b, c) := X^ =1 < Pa(Y s )tp b (Y s+1 )ip c (Y s+2 )/p, 

N M {a, b) := X)s=i Pa(Y s )pb(Y s +i)/p and P M (a, c) := YZ=i Pa(Y s )ip c (Y s+2 )/p. 

[Step 2] Let U be the M x K matrix of orthonormal right singular vectors of P m corresponding to its top K 
singular values. 

[Step 3] For all b £ {1,..., M}, set B(6) := (U t P m U)^ 1 U T M m (. , b ,. )U. 

[Step 4] Set 0 a (K x K) unitary matrix uniformly drawn and, \/x £ X, C(a;) := E^fcf=i(U0)(&, x)B(fi). 
[Step 5] Compute R a (K x K) unit Euclidean norm columns matrix that diagonalizes the matrix C(l): 

R-'C^R = 2)iag[(A(l, 1),..., A(l, K))]. 

[Step 6] For all x, x' £ X, set A(x, x') := (R^C^jR^a/, x') and O m '= U0A. 

[Step 7] Consider the estimator (fM,x)xex defined by, for all x £ X, Jm, x '■= J2m =1 x)tp m . 

[Step 8] Set tt := (U t 6m) _1 U t L m . 

[Step 9] Consider the transition matrix estimator Q:=n TM ((U T 6 A fSia 0 [7r]) 1 U t NmU(6^U) 

where IItm denotes the projection (with respect to the scalar product given by the Frobenius norm) 
onto the convex set of transition matrices, and define 7 ? as the stationary distribution of Q. 


Ik* 

IIQ* - 


d ~ M lo g^ m(®M) r 

J r M 7T||2<C(Q ,A )-7-— 


s Vp 


Vp 


Corollary 3.2. Assume that [HI’] and [H2]-[H3] hold. Assume also that for all x £ X, f* £ L 2 (JL C D ). Let 
M p be a sequence of integers tending to infinity and such that 773 (4>m p ) = °{\/p/ logp). For each p, define 
f, Q and n as the estimators obtained by the spectral algorithm with this choice of M p . Then, there exists a 
sequence of permutations T p £ Sk such that 


E [ m ax||/^ - f T 


)l| 2 ] V E[||Q*-P Tp QPj p ||] V E[|| 7 r*-P Tp 7 r|| 2 ] = 0(%($ Mp ) y/^SP/p) = o(l)- 


Here, the expectations are with respect to the observations and to the random unitary matrix drawn at [Step 4] 
of Algorithm 1. 

Proof Apply Theorem 3.1 where, for each p, we define S p such that (— log S p )/6 p := log p. 5 P goes to 0 and 
M p goes to infinity as p tends to infinity so that for any large enough p, M p > M$*. Let t p the permutation 
t m p given by Theorem 3.1. Then, for all p/(N(Q*, 3*)??3(4>m p ) 2 logp) > u > u( Q*), with probability 

1 - 4e"“ -2 S p , 


max||/^ iX - / m ,tm(x) llaVlk* - P rp 7f|| 2 V||Q* - P rp QP^||< C(Q* ,S*)m(^M p )V l ogp/pVu. 








Figure 1: Estimation of emission laws of beta distributions with parameters (2, 5) and (4, 3) using the spectral 
method. The projection basis is the histogram basis (left panel) or the trigonometric basis (right panel). 


It yields 


lim sup E 

p —>+OO 


%(^M p ) 2 l 0 gp 


|Q*-P Tp QP. 


T ||2 


p+OO 


^CfQ*,^*) / lim sup „ . _ 

Jo p-*-+oo VC(Q*,5*)p3(^M p )Vlogp 


Vp 


|Q*-P Tp QpT||>^ dw 


P~r oo 

< C(Q*, S*)VQ*) + C(Q*, S'*) 2 / 4e _ “dit < +oo . 

The proof is similar for the other terms. 


□ 


4 Experimental results 


We have run several numerical experiments to assess the efficiency of our method. We consider K = 2 emission 
laws of beta distributions with parameters (2, 5) and (4, 3). In all our experiments, the transition matrix Q* is 
given by 


Q+ := 


/ 0.4 
y 0.8 


0 . 6 \ 
0 . 2 ) ' 


We observe a sequence of n = 6 x 10 4 variables (yj)" =1 . As projection basis, we have considered the histogram 
basis or the trigonometric basis. The minimax adaptive procedure described in [7] gives an estimation of Q, 
and of the emission laws. Using the slope heuristic [4], we find that the selected size of the model is M = 11 
in the histogram case and M = 13 in the trigonometric case. Figure 1 presents the adaptive estimation of the 
emission laws. From these estimates, we compute an estimation of the marginal smoothing probabilities using 
the forward-backward algorithm. The results are presented in Figure 2. 


A Control of the filtering error - Proof of Proposition 2.1 


Let y\. n £ y n . The aim of this section consists in establishing that the total variation error between yi :n ) 
and its approximations based on Q and / is bounded uniformly in time k. Before stating the main result, we 
introduce a standard decomposition of the filtering error yi-.k) — <)>&(•, yi-.k)- For all k > L let F£ jJ/fc be the 
forward kernel at time k and its approximation, defined, for all v £ V(X), as: 

p* ( ) Ex's* Q*(aV0/*(y fc M3/) 

k ’ Vkiy(X> ■ T, x ', x "ex ’ 


and 


F k,y k v[x) : — 


Ex's* Q(x\x)f x {y k )is(x') 
T, x ',x"ex Q {x',x")f x "(yk)v(x') 
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Figure 2: Marginal smoothing probabilities obtained with the forward-backward algorithm combined with the 
spectral method using projection of the emission laws on the histogram basis (top panel) or the trigonometric 
basis (bottom panel). 


Clearly, for all y 1:n e y n and 2 < k < n,<j>l(-,yuk) = ^k.y^k-ii^ Vi-k-i) and 0fc(-, 2/i = fc) = F k ,y k <f>k-i(-,yi-.k-i)- 

The filtering error is usually written as a sum of one step errors. For all k > 2, 

tfrki'i yi:k) ~ (j>k{-,y\-.k) = (•, yi-.k—l) ~ F k,y k <pk—l ('j Ul-.k— 1 ) 

k-1 

= ^ k,e{ye-.k ) + Ffe j j /fc </>fc-i(-, yi-.k-i) - Ffc j j /fe (/)fc_i(-, yi-.k-i ), (8) 

r=i 

with F* )J/1 $o = 0*(-,2/i) and 

A kAV(-k) ■= F * k ,y k ■ ■ ■ F<+i,j/ /+1 F| )W $f-i( , i J/i^-i) - Ffc yfc ... F£ +1>w+i <£<?(•, 2 /^). 

Let /3^ k [ye+i-.k] and F ^ k [ytk] be the backward functions and the forward smoothing transition matrix as defined 
in [6, Chapter 3], 


Pt\ k [yi+I:k\{xe) := 

F e\klye-.k](xe-i,X() := 


Y Q*{xe, xe+i ) f* t+ i (ye+i ) • • • Q^fc-i ,x k )f* k (y k ), 

X &+1 :k 

Pe\ k [yt+i--k]( x *)Q*(xt-uxt)f2 l (ye) 

Pi\ k [ye+i-.k](x)Q*(x e - lf x)f*(y e ) ' 


(9) 

( 10 ) 


In the sequel, the dependency on the observations may be dropped to simplify notations. By [ 6 , Chapter 4], for 
any probability distribution u, F£ ... F£ + 1 i/ = ^r|fcF^ +1 | fc ... F^ fc , where u^ k cx /3t k is. Therefore, the filtering 
error ( 8 ) is given by: 


k -1 


4>k~^k = Y {ft I 

1=1 


fcFf+i|fc ■ • ■ F^ fe Pz\kFt+i\k 


• • -Ffeife) +F^ fc _i - Fk<i>k-i , ( 11 ) 


where p,t k oc and oc /3^ k <j)e- By [Hl]-b), the transition matrix F^ n can be lower bounded 

uniformly in its first component: 

* , < 5 * Pt\k\yi+i--k\{x')f*'(ye) 

elk[x,x) ~ 1 - <5* /3?| fc [% +1 =fc]W/**(w) ' 

By [ 6 , Chapter 4], this allows to write. 


Mr|feF^ +1 | fc ... F fc | fe — ^|feF| +1 |fc 


...F 


k\k 


A pl \\Pf_\k W|fc||tv • 


( 12 ) 


Eq. (12) is the crucial step to obtain the upper bound for the filtering error stated in Proposition 2.1. By (11) 
and ( 12 ), 


k -1 


H%-Mt v <Yp*~ 


Pl\k W| k 


F/c 0 A :—1 Fk^k-l 
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For all 1 < t < k — 1 and all bounded function h on X , \/i^, k (h) — 'j2{\k(h)\ < Ti + T 2 where 


'E x€X Pe\kiyt+i--k\( x ) h ( x ) [F}<pe-i(x)-F^e-^x)^ 

T\ := -—- , 

Ese* Pe\ k [ye+uk\(x)Fe<l>e-i{x) 

Exe* ^\k\vt+^-k]{ x )K x )^J2xex Pe\k[ye+ 1 *]( x ) F e&e-i( x ) ~ F e<i>e-i(x) 

T'2 ~ -—----' -—- • 

12xgx P^kiye+i'-kKxWefa-iix) Exe* Jw+i : fc](z)F^_i(a;) 

Both Ti and T 2 are upper bounded by the same term so that 

rji , ^ ^ 0 Halloo • Wt\k[yi+^-k]\\oO ^ 9 7 II 

Ti + T 2 < 2—---try F<W-i - Fw-i tv • 

By (9), for all x € /3* fc [^+i :fc ](a;) < (1-6*) E* k+1:tl fx k+1 (Vk+i) ■ ■ ■ Q*{x n -i,x n )f* n {y„) and P* lk [y e+1:k ](x) > 

6* E* fe+1;n /^ +1 (yfc+i) • ■ • Q*( x n-uX n )f Xn (y n ), showing that 


||F£(^_i — F^^_i||tv ■ 


Ti+T 2 <2 ||/i|| c 


Now, for all 2 < £ < k and all bounded function h on X, F£0^_i(/i) — F^^_i (h) < R\ + f? 2 , where 


i?t := 


R 2 := 


T,x,x’€X<i>t-i( x ) Q *{x,x')f*,{y t )-q,{ x ,x')J xl {ye) h(x') 
Ex.x'eA’^-iWQ *(x,x')f*,(yg) 

I E^.x'ga' 0*-i(aOQ(a, x')f x '(y<>)h(x') I 


Ex,x'e*^-l( X )Q(*> *')/*'(j/*) 


Ex.x'ex^-iW Q *(x,x')f*,(y e ) - Q(x,x')f x 
'Ex, x 'ex^e-i{x)Q *(x,x’)f*,(y e ) 


x,x'(ElX 


Ri< ^2 &-i( x )Q*(x,x')f*,(yt) ^2 <f>e-i(x) Q*(x,x')f*,(ye) - Q(x,x')f x 


X, x'^lX 


<f ^-i( a: )Q*( a: > a:, )/x'(^) j Q*( X ’ X ')~Q( X ’ X ') f£’(ye)H x ') 

\x,x'£X J X,x'(zX 

+ ( ~22 <t>e-i(x)Q*(x,x')f*,(y e )\ ^2 ^e-i(x)Q(x,x') f*>(ye) - fx>(ye) h{x') 


X,x'(z:X 


X,x'(zlX 


< Halloo I^IIQ* - QIIf/6* + c* (ye) max |/*(y^) - f x (y £ ) |j , 

where c* is defined in (1). The same upper bound holds for f? 2 . In the case l = 1, 


11 Ft 00 - 0i|| tv < ||0t - 0i|| tv < 2 ^TT* - 7 t|| 2 /<5* + c* (y-i) max |/*(t/t) - /x(j/i)|J • 

Therefore, the filtering error is upper bounded as follows: 

k 

Hi - 0fc||tv < 4 (—TT-) ^2p*~ e IIQ*-Qlk/<5* + c* 1 (^)max f*(ye) - f x (ye) 

' ' r =2 L * 

+ 4 f^r-)pE 1 II 7r * — ^11 2 /^* + c * 1 (yi)niax I/*( 2 / 1 ) — /x(yt)| • 
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B Control of the marginal smoothing error - Proof of Proposition 2.2 

Let y\ :n £ y n . The aim of this section consists in establishing that the total variation error between yv.n) 

and its approximations based on Q and / is bounded uniformly in time k. Before stating the main result, we 
display the decomposition of the smoothing error yv.n) — Vv.n) depicted in [10] and used in [14] 

to obtain nonasymptotic upper bounds for the marginal smoothing error when (•, yv.n) is approximated 
using Sequential Monte Carlo methods. In the sequel, the dependency on the observations may be dropped to 
simplify notations. For any bounded function h on X n , <t>v n \ n (h) can be written, for any 1 < t < n 

J.* , u ^ ’ M) 

01:n|n( 

where 1 is the constant function which equals 1 and, for all Xvx £ X f: , 

n 

L *e,n( X l-X,h) := Y n Q*i X u-l, x u)fx u (yu)H x l:n) ■ (13) 

2 l£+l:rt.C ?C n ~^ ^=-£+1 

As for the filtering error, the smoothing error can be decomposed as a telescopic sum of one step errors: 


n 

4>V.n\n{h) ~ <f>*v. n \n( h ) = Y 


t—2 




4>ve-i\e-i(L}_ l n (-, h )) \ 

(t>v.i-i\i-i{L}_ ln {-, 1))/ 

| MLlJyh)) </>ULj n (;h)) 


This smoothing error can be written using filtering distributions only by introducing the following backward 
operators: 


Cl n {x e ,h):= Y x t-i) ■ ■ ■ ( x, 2,xi)L^ n (xv.e, h ), 

Ce. n (x e ,h) ■.= Y B $ l ,_ 1 ( Xe ’ Xe - 1 ')--- B $S X2 ’ Xl ') L * e ’ n ( Xl:e ’ h ' ) ’ 

Bht — 1 


where for all v £ V(X), Ij„ is the backward smoothing kernel given by 

,, Q4x',xMx') 

A ’ T,zex Q*(z,x)u(z) ‘ 

Then, for all 2 < t < n, the one step error at time l is given by 

<j>v.e\e(L} n (-,h)) (j> e (C tin (-,h)) 


£,n{}l) :— 


e-i(£e~i,n(v h)) 


</>i:e\e(L^ n {-, 1 )) (j>v.e\e(Lg n (-,t)) 1,„(-,!)) 


(15) 


This decomposition allows to obtain the upper bound for the marginal smoothing error stated in Proposition 2.2. 
The result is obtained by applying the decompositions (14) and (15) to a bounded function h on X" which 
depends on Xk only: for all (x \,..., x n ) £ X n , h(x \ : ..., x n ) = h(xk). The one step error given by (15) is 
then analyzed separately wether k > £ or k < L 


Case k > £ ^ 

In this case, the function n (-, h) defined in (13) depends on xt only. Therefore, C(. n {x(.. h) = L* ln (xe., h) = 

C\ n (xi, h ). Thus, Ce-i, n {xe-i,h) = J2 Xe &x Q*(^-i, x i)f* t {yi)C* t n { x ^ h ) and the one ste P error given by 
(15) becomes 


M£jn(-’ h )) _ fa-l(T, xe exQ*(£ X e)fxM) C in( X t’ h )) 
M^lni-’ 1 )) ^-i(E x/ ea’Q *(-, x e)fi t (ye)^e >n ( x eA)) 
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Define the measures p t and p e on by pe(xe) := J2 xe _ ie x ^e-i{xe-i)Q*(xe-i,xe)f* t {ye) and pe(xe) := 
'52x t -iex<l>e-i( x e-i)Q( x e-i’ x e)fxt(ye)- Then - 


By [ 6 , Lemma4.3.23] and [Hl]-b), \S e ,n(h)\ < p*~ e ( 1 - <5*) \\pe/pi{t) - p e fjj, e (l)\\ tv ||/i||oo/<5*- Following 
the same steps as for the proof of Proposition 2.1 yields 


llw/w(l) - w/wWIItv < 2 IIQ* 


q\\ f /s* + 2c; 


l (ye) max f*(y e ) - f x (y e ) 

xEX 


The term ^i(L* „(•, /i))/0i(L* n (-, 1)) - „(•, n (-, 1)) is dealt with similarly. 

Case k < £ 

In this case, L\ n (xi-j, h) = h(Xk)C Fn (xe,t). Therefore, 


£t,n{ x t,h)= Y B $ t ^( x e, x e-i) ■ ■ ■ B $ 1 ( x 2> x i) h ( Xk )^e,n( x e> 1 ) , 
#1:^-1 

^k:£— 1 


On the other hand, if z^(a^) := x t )f* e (ye)C^ n (xt, 1), 


^ ue(xe)B^ e i ( x (^ x i-i) ■ ■ ■ B f k (xk+i,Xk)h(x k ) . 


Define vt(xt) := <j> t {xi)C* l<n {xt, 1) = <^-i(a^-i)Q(a^_i, a^)/^ {yf)C.} n {x t , 1). Then, the one 

step error given by (15) becomes 


>{h) = 


E ( vt(xi) 

l a?i t 


viixj) 

V^(l) Ke(l) 




0 <- 


^(a^a^-i)... B^ k (xk+i,Xk)h(xk) 


By [ 6 , Lemma 4.3.23] and the fact that, for all (,x , a;') € <T 2 , Q(a:, a/) > <5, 


|5/,n(/»)| < WHooP^ 


M-) 

Vt{t) 


M-) 

V{(t) 


tv 


As for all xi £ X, C\ n (xn, l)/||£J n (-, l)||oo > <5*/ (1 — 5*), following the same steps as for the proof of 
Proposition 2.1 yields 


M-) 

p«( 1 ) 


^(•) 

Ke(l) 




QIIf/<5* + c; 


L (j/^)max f*( ye ) - f x (y e ) 

x$LX 


C Nonparametric spectral estimators 


Theorem 3.1 follows from the following more precise results proved in this section. The proofs of the interme¬ 
diate lemmas require assumptions [HI’] and [H2]-[H3]. 

Lemma C.l. There exist a constant 0 < &k,s* < 1 one/ a positive integer Mg* such that for all M > Mg*, 


&k(Om) > &k,$* > 0 . 

Proof. By [H3], the (K x K) Gram matrix defined by O j O* := ((f Xl , f X2 )) x i , X2 ex is invertible. Let £g*,M 
be given by: 


:= ||O^O m - OjO*|| = || ((/m,xi.> /m,x 2 ) ~ {fl^fx 2 ))xi,x 2 ex 


(16) 


From (5), there exists Mg* > 1 such that for all M > Mg*, £g*,M < 3 A/AO . 1 0,)/4. By Weyl’s inequality 
(see Theorem D.l), (t‘ 2 k {Om) = A^O^Om) > A^(070*)/4. If cr# (O*) := A^ 2 (0* O*), note that for 
all AT > Mg*, cr^(Ojvf) > cr^ (0*)/2, which concludes the proof. □ 
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Define the pseudo spectral gap G ps of the Markov chain (X n ) n >i as 


G ps := max {G (S)ia 0 [ 7 r*] 1 (Qj) fc S)ia 0 [ 7 r*]Q*) / k } 
where G(*4) denotes the spectral gap of a transition matrix A defined by 


G(A) := 


1 — max{A : A eigenvalue of A , A ^ 1} if eigenvalue 1 has multiplicity 1, 
0 otherwise. 


Note that G ps depends only on the transition matrix Q* which is assumed to be aperiodic and irreducible with 
unique stationary distribution 7 r*. Perron-Frobenius theorem ensures that the spectral gap G(.4) is well defined 
and such that 0 < G(*4) < 2. 

Remark C.l. If Q* is aperiodic and irreducible then G ps > 0. In this case, there exists k such that Q* 
is positive (entrywise) and so is A := Siag[ 7 r *] _1 (Q^) fc 2 Ha 0 [ 7 r*]Qj. A is a positive transition matrix, 
Perron-Frobenius theorem ensures that its spectral gap is positive. 

Remark C.2. If Q+ is aperiodic, irreducible and reversible then G ps = G(Q*)(2 — G(Q*)) > 0, see [24] and 
references therein. 

Define the mixing time T m ; x of the Markov chain ( X n ) n >i as 


T • — 

11 mix • 


1 + 3 log 2 — log 7 r* lin 


G- 


ps 


This mixing time has a deeper interpretation in terms of convergence towards the stationary distribution in total 
variation norm, see [24] for instance. For any 5 £ (0,1), set 


C*(Q*, S) := \J 2/G ps + 2\/—2 T mix log 5, (17) 

which is a constant that depends only on Q* and 5. 

Theorem C.2. Assume that [HI’] and [H2]-[H3] hold. Let 6,5' £ (0,1) then, with probability greater than 
1 — 2 5 — 4 S', there exists a permutation r £ Sk such that the spectral method estimators fxi.x, 7? and Q (see 
Algorithm 1 for a definition) satisfy, for any M > Mg-*, 

- for all p > Ni(Q+, $*, 4>m, 5, S') and all x £ X, 


II Im,x ~ /m,t(x)I|2< Cm(Q*,3* ,5)C*{Q+,5')i) 3 ($m)/\/P , 


(18) 


- for all p > N 2 (Q S'), 

|| Q* — 11^ 3+ <5)C*(Q*, S')t] 3 (<S>m)/\/P , (19) 


- for all p > N 3 (Q S'), 

||7T* - P r 7r|| 2 < ^Af(Q*,S'*,^)C*(Q*,<5 , ) ? 73(^M)/v / P: 


( 20 ) 


where P r is the permutation matrix associated to t, and 


Ni(Q *,r,*M,S,S') := 


IK 

3(J /£5- 


■Cm( Q*, 3*, <5) 2 C*(Q*, S') 2 V3 (<f> M ) 2 , 


N 2 (Q *,r,*M,S,S') := — —V' M (Q+,r,6) 2 C*(Q+,5') 2 m (<S> M ) 2 


min 

4 


N 3 (Q*,3*,$ m ,M') := -t^m(Q*,3V) 2 C*(Q*,<5') 2 %($m ) 2 


'K 


( a q. 
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with 


Cm(Q+,$*, S) := 


max ||/*|| 2 

x£X 


cr 2 K r TT* nin a K (Q 


1 + 


lb* 


7 r min°‘lf ) ®*^(Q* 2 ) VM. 


13a 2 (Q,)A^/ 2 <c|. , 83 k 6 (Q*)K 5 4-^H/fc 


Z>W(Q*,3V) := 


^min^A'lQ* 2 ) ^ ^min^A'lQ* 2 ) CT 1',S 

U\/FC M (Q^,g*,<5)max ||/*|| 2 + 3 ^f5’ g 


1+ 2 log 


A'" 


2 \ 1/2' 


^ a K,S 


x€X 


M 


v (Q nr* S) . 8 ^{Y u Y 3 )h 

L > m ( Q *, tS ,<5; .—^2-T 2 - 


min 


(Q*, r ,5) + 4\/3^7r* in C M (Q*,5*,<5) + 


5tt* 


II/ ( V 1iV - 3) ||2Vm 


^(Q*,r,5) :=- 


16 ||/ ( 


(n,v 3 )ll 2 


'K 


( A Q* 


)°A,3'* 7r min 


^(Q*,r,<5)+4^7r* in c M (Q*,r^) + 


57T* ■ , 


||/ ( V 1 ,y 3 )H^. 


where Kg* is given in Lemma C.4,for all ( 2 / 1 , y 2 , 2 / 3 ) G y 3 . 


9*{yuV2,y3) ■= 7i*(a:i)Q+(a:i,a:2)Q+(a;2,a;3)/* 1 (yi)/* 2 (y2)/^3(2/3), 

and <j 2 k (Aq, ) is the K-th largest singular value of J (which is positive, see (29)). 

V 1 zv / 

Theorem C.2 is proved using the analysis of [3] to control the L 2 -error of the estimation based on the 
spectral method described in Section 3.1. To use their result in the nonparametric framework, it is essential to 
state explicitly how all constants depend on the dimension M. We thus need to recast and optimize the results 
of [3], This is done in Theroem C.3 which is proved in Appendix F. Define 


7(O m ) := min ||O m (- , x\) - O m (- , ®2)|| 2 

Xly±X2 


and for all A E 


xMxM 


and all B E 

M 


xK 


Plloo,2:= max 
V 2 = 1 


^2v b A(.,b, 


b—1 


and ||f?|| 2l00 := max \\B(., x)\\ 2 
x€?c 


( 21 ) 


( 22 ) 


Theorem C.3. Let 0 < 6 < 1. Assume that 3||Pm — Pm||< (Jk(Pm) and that 

IMmIIoo^IIPm — Pm | 


8.2K 5 / 2 (K — 1) k2(Q *°" 


A3AK 4 (K - 1) 


c>7(Om)ctr-(Pm) 

^ 4 (Q*Om) 

Sj{Om)<jk (Pm) 


I M m — Mm||oo,2+- 
|Mm — Mm||oo,2 + - 


cta'(Pm) 

M m ||oo, 2 ||Pm — Pm | 
<7a'(Pm) 


< 1 . 


< 1 , 


(23) 

(24) 


then, with probability greater than 1 — 26, the matrix U t PmU is invertible, the random matrix C(l) is 
diagonalisable (see Algorithm 1), and there exists a permutation r G Sk such that for all x G X, 


|Om(. ,x)-Om(. ,r(*))|| 2 < 2 ||P m_ P mII ||Om|| 2 .oo- 


13A' 1/2 


ca'(Pm) 

^ 00 * 0 ^) 4 
cta(Pm) 


I Mm — Mm||oo, 2 + 


|M 


m||oo,2||P m — Pm 


116 AT' 


o"a(Pm) 

ji /21 fjT 2 / 5 )) 1/2 ] " 6 (Q*°L)IIOmII 2 . 

\ ^ J 6^(Om)ctk{Pm) 


Preliminary lemmas 

Lemma C.4. There exists a constant Kg* that depends only on J* such that for all M > Mg*, k(0 m) < Kg* 
where Mg* is given in Lemma C.l. For all M > Mg*, a(Q*Om) < Kg*n( Q*). 
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Proof. Note that 0< O , is nonsingular. From (5) and (16) we deduce that O { f O m tends to Oj O , as M grows 
to infinity. This proves the first point. Recall that er i{AB) < cj\ (A) a, (//) for all* = l..... K. Applying this 
identity to A = Q * _ 1 and B = Q*O yields <Jk(Q±)<Jk(O m ) < (TKiQ+Ojf). It follows that k(Q*0^) < 
k(Q*)k(0m). The second claim follows from the first claim. □ 

Lemma C.5. For all M > Mg*, 7 ( 0 m) > V^ctK,g* and ||Om|| 2 ,oo< nra x x& x\\ff\\ 2 , where 7 ( 0 m) and 
||Om|| 2 ,oo are defined in ( 21 ) and ( 22 ). 

Proof Observe that ||OMf|| 2 > ck(.Om)\\v || 2 - With an appropriate choice of v and using Lemma C.l this 
proves the first inequality. As 'I>m is an orthonormal family, ||Om(- , x)|| 2 < ||/£||2 which proves the second 
claim. □ 


Lemma C. 6 . For all M > 1, 


|Mm||oo, 2:= max 

V 2 = 1 


M 


^ v b M M (. ,b,. 


6=1 


< 11 / 


where || • is defined in ( 22 ). 

Proof As for all x £ X, /* G L 2 (y,£ D ), g* G L 2 (y 3 , £° 03 ). Denote by (.,. )l 2 (^ 3 ,£ d ® 3 ) the inner product 
of L 2 (y 3 ,£ D ® 3 ). As ^o,b,c(yti 2/2,2/3) := £a{yi)£b(y2)£c(y3) is an orthonormal family of L 2 (y 3 ,£ D ® 3 ), 


|Mm||oo,2 = max 
Hh=i 

/ M 


M 


^ VbM M {- ,b,. 


6=1 


M 


< 


1/2 


max yK|||M M (.,6,. 
v ||2 = l .. 

\ 1/2 


6=1 


M 


< ^||M m (., 6,.)|| 2 < m|M M (.,&, 


2 

) • )\\F 


Kb=l 
M 


\b=l 


1/2 / M \ 1/2 

= ( ^EW%» c (y 3 )f] = 53 < II/II 2 . 


,,b,c=l 

using Cauchy-Schwarz inequality. 


,,b,c=l 


□ 


Lemma C.7. For all M > 1 , ||Mm — Mm||oo,2< ||Mm — Mm||f, where || • Ho^ is defined in ( 22 ). 
Proof. For all AI > 1, 


IM m - M m ||oo ,2 = max 
v 2 = 1 


M 


M 


«s(Mm — Mjf)(,, b,.) 

\ 1/2 


6=1 


M 


< 


< ]T (m m -m 


v6=l 

M M — M M 


max 5>&l (M m — Mm)(. ,b,.) 
■■ v '' 2 ~ 1 6=1 

/ M n _ ..o\ 1/2 


Kb=l 


using Cauchy-Schwarz inequality. 


□ 


Lemma C. 8 . Under [HI’] and [H2], for all M > 1, ctf-(Pm) > 7r m i n cr % (O m ) v k (Q 2 ). If \ H3 ] holds, then, 
for all M > Mg*, 

CTk(Pm) > 0’K,3*T r min a 'K(Q* ) , 

where Mg* and are defined in Lemma C.l. 

Proof. By Lemma F.l and (7), 

^(Pm) = ^(U t P m U) = a if ((U T O M )©ia0[7r*]Q* 2 (U T O M ) T ), 

> aK(U T O M )^(Sia0[7r*]Q* 2 (U T O M ) T ), 

= (titfOifK(£»ia0[7r*]Q* 2 (U T O M ) T ), 

> a^CSHas/*])^ (OmW((U T O jn) T )ffir(Q* 2 ), 

= 7Tmin cr l'( 0 M)crK(Q/) , 

which concludes the proof. □ 
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First step: Estimation of the emission laws using a spectral method 

Appendix E shows that: 


IILm — L M ||ir> C*(Q*, S') Vl ^ M )/s/p 
Nm — N m ||f> C,(Q*,S') m (^ M )/VP 




<< 5 ', 


M m — M m ||f> C*(Q*, <5> ? 3 ($m)/ s/P 
||Pm — Pm||f> C*(Q*, S')ri2($M)/y/P 


<S', 


< 5'. 


Using the preliminary lemmas of Section C and the elementary fact that A/?7 i($m) < y/Mrfe/l ?m) < V3 ($m)> 
deduce that (23) and (24) along with 3 ||Pm — Pm||< &k(Pm) are satisfied when M > Mg* and p > 
N 0 (Q S') where: 


N 0 (Q *,F,$m,S,6') := 


■( 


1 +- 


lls* 


942 k 8 (Q *)K w /eg 

52 ^min^-CQ* 2 ) a K,3* ^ ^ 


1 =) 2 C*(Q*,<) , ) 2 %($m) 2 . 


Using Theorem C.3, with probability greater than 1 — 2 6 — 4 S', there exists a permutation r satisfying for any 

M > Mg*,p > N 0 (Q*, #*, $ M , <5,5') and x G 

II O m (- ,at) - 6 m (-,t(V))|| 2 < Cm(Q*,S*, S)C*(Q*, S')r]3($ m) / \/p ■ 

This proves the first part of Theorem C.2. 


Second step: Preliminary estimation of the stationary density using a spectral method 

For sake of readability, assume that r is the identity permutation. Observe that: 


n^q*, r, &m, s , <5') > n 0 (q*, r, $m, <y, «'). 

Recall 7 T := (U T 6 M ) _ 1 U T L M and n* = (U t O m )” 1 U t L m . 

Lemma C.9. WzY/z probability greater than 1 — 2<5 — 4<5', if p > Ni(Q*, 3*, 5, i> / ) f/zezz. 


I|7t-7r*ll2< 






V3c K ,$* “ ^5 

2 \/Ni(Q*,5'*, $m,<5, 5') ( 

+ “7f-7=- /tvt /rv ~ f cn ( maxll/^Ha +C*(Q*, A ) 

v3 cr K ,d* s/P~ vNi(Q*,S^' *,$ M ,S, S') \x£X 


Vp 


Proof. Set A = U t Om, A = U t Om and B = U t (Om — Om). Then, 

||-B||< ||Om — Om||< ||Om — O m ||f< Vk max|| 0 M (-, x) — O m (. , ar)||2, 

x€lX 

which gives ||73||< v / A^Cm(Q*,S'*, 6 )C*(Q*, S’)p 3 (^m)/ y/p. Similarly, by claim (iii) ofLemmaF.3: 

||A _1 ||||f3||< cr^ 1 (A)||J5||< 2 v / ^max a ,g^||0 M (. ,ar) - 0 M (. ,x)|| 2 , 

V3a K (0 M ) 


so that 


II A-'BWK ^ CM(Q„r,^(Q„^)^ ■ 

V 3cr K,$* y/P 

Observe that the condition onp and M ensures that ||A _ 1 f3||< 1. Apply Theorem D.2 to get that: 

||(U t Om) _1 - (U t 6 m )- 1 ||< - y/Ni (Q 

V3a K ,g* y/p- VNi(Q *,$*,$m,8,5') 


(25) 
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Furthermore, using (25): 


||7T-7r*|| 2 = ||(U t 6m) _1 U t L m - (U t Om)“ U t Lm|| 2 

= || (U t Om) _1 U t Lm - (U t O m )" 1 U t L m + (U t Om) _1 U t L m - (U t Om)“ 1 U t L m || 2 
< ||(u t Om) _1 - (U t 6m)“ 1 ||||£m|| 2 +P _1 ||||£m - Lm|| 2 
2 


< 




,3* 


L m — L m || 2 + 


y / N’ 1 (Q*i SN <£ S') ( 

\/V — \/Ni(Q*, S’*, , <5, S') 


^mI| 2 +I|Lm — l M |i 2 


2 ) 


Denote ff = J2 X iex 7r ( x i)/fe 1 (j/i) the density of T). Observe that: 


M 


\ V 2 / M \ t/ 2 

|Lm|| 2 = ( 5Z E [^»( y i)] 2 ) = ( X)</Yi><A*> 2 ) < Il/yj| 2 < max||/*|| 2 . 


which concludes the proof. 

This results allows to state that for all p > 4Ni(Q*, 3*, 4 *m, 5, 5'), 

||tt* -Pr7r|| 2 < ^Af(Q* I 3*,<5)C*(Q*,(5 , )t?3($M)/v / P- 


□ 


(26) 


Third step: Estimation of the transition matrix using a spectral method 

Denote Q := (U 1 OAf2)iag[7r]) 1 U T N A /U(0| f U) 1 . Observe Q = IItm(Q) and Q* = IItm(Q*) and 
hence, by non-expansivity of the projection onto convex sets, || Q — Q*||f< ||Q — Q*||f- Moreover, notice 
that: 

N 2 (Q*, r, *M, 5, S') > 4N!(Q*, r, $M, <5, S') > N 0 (Q*, r , $ M , S'). 

Lemma C.10. With probability greater than 1 — 25 — 4 S', if p > N 2 (Q*, S*, <f>Ff, 5, S') then 

'IQ- Q*ll< ! ll , /(yi,y : ) J l2 - 7r*h+^£ M (Q„r,S)C,(Q^S')^ M) 


^ (T A',5* 7r,r min 


Vp 


where 


£m(Q*,F,5) := 


16 


V^c 


I<,3* 


VKC M {^,r,6)\\fty u Y 3) h+^ m 


Proof. Observe that (20) shows that ||7r — 7r*|| 2 < 7r* nin /2. Then, for any x € X\ 


7r • 

~ \ ' min n 

Ttx > > U . 


(27) 


Set V = (U t Om) _ 1 U T and V = (U T 6 M )" 1 U T . Note Q = S)iafl[7r]- 1 VN M V T and: 

Q = 2)iQg[7r*]- 1 VN M V T . 


Set E = V — V and F = Njj — Njf, Using (25) yields: 


l|£|l< 


2 \/Ni(Q*, 3*, $m, 5, S') < 8 VK 

\/3 y/p- \/Ni(Q*, 3*, $m, 5, S') 3aj < r 


Cm(Q+, S*, <5)C*(Q*, S') 


?73 ($m) 


By claim (iii) of Lemma F.3, ||V||< ct a - 1 (U t Om) < 2/(V3<t k ,s*)- Furthermore, <p a , c {y\, 2 / 3 ) := <p a {yi)<Pc{y 3 ) 
is an orthonormal family of L 2 (3^ 2 , C D ® 2 ) and 


M 1/2 M 1/2 
INm||f= ( 5] E[ Vo (y 1 ) Vc (y 3 )] 2 ) =(E (/(Vi,y 3 )’^ a ’ c )L 2 (^ 2 ,£°® 2 )) — ll/(Vi,y 3 )H 2 * 


a,c= 1 


a,c=l 
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Then, 


||vn m v t - VNmV t || = ||VNmV t - (V + E)(Nm + F)(V + E) t ||, 

= ||VN m ^ t + VFV T + VFE t + £N Aif V T + EN M E r + EFV T + EFE T \\, 
<2||£;||||V||||Nm||+2|| j E;||||V||||F||+||E|| 2 ||Nm||+||V|| 2 ||F||+||£;|| 2 ||F||, 


yields 


VN„V t - VNmV 1 ||< 


■T| 


, 32VKCm(Q+, ■&*, ^)C*(Q*, F)\\f* Yl ,Y 3 ) II 2 


3\/3 c 


K,d* 


C+(Q±,5') 

11 /(^, 13)112 VpM 


2VKC m (Q*,$\5)C+(Q*,S') %($ m ) 


V3ctk,$* 


Vp 

1 


4C M (Q„r,S)\\f^ Y3) hVKVM 

2VkC m (Q„ S*, <?)C»(Q*. ^') 2 r$ ($m) 

V3a K ,$* ll/(V 1 ,y 3 )l | 2 P 




M 


Vp 


Asp> N 2 (Q*,S : *, $m, <5, S') > 4Ni(Q*, #*, 3>m, 5, S') = ^^C M (Q*,r^) 2 C*(Q*,<5 , ) 2 »?3($M) 2 , 

° K,3* 


|VN m V t - VN m V t ||< £m(Q*,T,S)C*(Q*,6‘ 


Vp 


(28) 


Observe that: 

||Q* - Q|| = llOiag ^*]" 1 - Dia 0 [7r]- 1 )VN M V T + Diag[^- 1 (VN M V T - VN M V T )|| 

< UDiag ^*]” 1 - 2 )iag[ 7 f]- 1 ||||V|| 2 ||N M || + || 2 )iag[ 7 f]- 1 ||||VNMV T - VN M V T | 

z 4 ll/(Vi,y 3 )ll 2 , *_i ^ ^ lnk ^V3 ($m) 

< — n —2 - max(7r -7T X ) + max 7r x £m(Q*, 3, 5)C*(Q*, 6 )- 

dal- i£i 


< 


8 ll/(Wa | 

Hrr 2 7T * 2 1 

ou K,3*' 1 min 


Vp 


^min 


using (27) and (28). □ 

Combining (26) and Lemma C.10 proves the second point of Theorem C.2. 


Last step: Final estimation of the stationary distribution 

By [HI’], we know that the transition matrix Q , is irreducible and aperiodic. Perron-Frobenius theorem shows 
that Q* has a unique stationary distribution ir *. More precisely, 

- K . 7 r* = ker(Idx — (Q*) T ) so that (R. . 7 T *)- 1 = range(Id/ < : — Q*), 


- and ( 7 r*, 1 k) = 1 , 

where 1 a = (1,..., 1) £ R A . We deduce 1 a 4- range(IdA — Q*) and 


Rank 


/Id K - (Q*) T 


= K. 


(29) 


Set 


A = 


Id/i — Q 1 
1 * 


and A* = 


Id* - (Q *) 1 


1 * 


We first derive an upper bound on |/L — (A*)" | where A + denotes the Moore-Penrose pseudo-inverse of A. 
Note that 


A + - (A*) + = (A*) + (A* - A)A + - (il*) + (IdA'+i - AA + ). (30) 
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The last term can be written as 


(A*)+(Id K+1 - AA+) = (A*)+(A*(A*)+)(Id K+1 - AA+) = (A*)+ ^rangef ,4 * ) -^"ran g e(A) > 

where -P ran ge(A*) = A*(A*) + denotes the orthogonal projection onto range(A*) and P la , nge (A) ± = Id/r+t — 
AA + denotes the orthogonal projection onto the orthogonal of range(A). Define 

«(Q *) := <tk(A*) . (31) 


Lemma C.ll. If ||Q — Q*||< s(Q+)/2 then Rank(A) = Rank(A*) = K and 


l|-frange(A*)-f > range(A)- L II — 


2||Q-Q,|| 

«(Q*) 


Proof. The first point follows from Weyl’s inequality, see Theorem D.l. By [28], 


II -frange(A*) - 1 •^range(A) II II-^range(A)- 1 --frange(A* ) II • 

Moreover, since projections P are orthogonal (P ran ge(A)-L-Prange(A*)) T = ^ran g e(A*)-Pran ge (A)-L. Using nota¬ 
tion of [28], one may notice that ||sin#(range(A), range(A*))|| = ||-F > ran g e(A*) J --Uran g e(A) ||- By Wedin’s theo¬ 
rem [28], if ctk{A) > s(Q+)/2 then ||sin 6 *(range(A), range(A*))||< 2 ^Ja*^ • We conclude using Weyl’s 
inequality, see Theorem D.l. □ 


Triangular inequality in (30) gives 

P+-(A*)+||< ||(A*)+||||Q-Q 

IQ - Q* 


< 


ctk(A *) 


A+ - (A*)+|| + 


o K (A*), 
3 


<?k(A*) 


)• 


using that ||(A*)+|| = 1/ok(A*). Deduce that if ||Q—Q*||< ok(A*)/2 then || A + — (A*)+||< 6 ||Q — Q*|| /o\ 
From Weyl’s inequality, if ||Q — Q*||< ok{A*)/2 then ok (A) > ok(A*)/2. Id/r — Q T has rank K — 1 and 
the eigenspace ker(Id/< — Q T ) has dimension 1. Thus, Q is an irreducible and aperiodic transition matrix, and 
7 r is the unique solution to 


fld K - Q 1 

V *K 



(A*). 


Now || 7 r — 7r*|| 2 < ||A + — (A*)+|| and the last part of Theorem C.2 is proved. 


D Matrix perturbation 


We gather in this section some useful results in matrix perturbation theory. Proofs of the following theorem may 
be found in [26] for instance. 

Theorem D.l (Weyl’s inequality). Let A, B he (p x q) matrices with p> q then, for all i = 1 ..... < 7 , 


\o.i(A + B) - Oi(A)\< 01 (B). 


Theorem D.2. Let A, B be (p x p) matrices. If A is invertible and ||A 1 2?||< 1 then A 
and 


Hi - 1 


^1- 1 II< 


PUP - 1 !! 2 

i-p-^ir 


A + B is invertible 


Theorem D.3 (Bauer-Fike). Let A, B he Ip x P) matrices and A := A + B. Assume that A is diagonalizable, 
i.e. X~ x AX = A, where A = £>iag[(Ai,..., A p )]. Then, 


SVA(i) < K(X)\\B \\, 


where sva{A) := maxmin|Aj 
3 i 


Ai| and A j denotes the eigenvalues of A. 


(32) 


Remark D.l. Moreover, if the disks 2?,; := {£ : |£ — Aj|< /v(X)||f?||} are isolated from the others, then 

(32) holds with the matching distance md(A, A) < K.(X)||i?|| where md(A, A) := min max|A T (j) — Aj|. 

re5 p i y 

Eventually, if A, A are real valued matrices then A has p distinct real eigenvalues. 
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E Concentration inequalities 


Consider consecutive observations of the same hidden Markov chain Z s := {Y a ,Y s+ i, Y s+ 2 ) for 1 < s < p. 
Lemma E.l. For any positive u, any M and any p: 


Lm — Lm||f > 
Mm — M m ||f > 
||N m -N m ||f> 
||Pm — PmIIf > 


\/pGp S 

V^y 3 (^ M ) 

\ZP& ps 

V2V2($m) 

vV^ps 


(1 + 2us/l + log(8/7r* mi n))J < exp(— w 2 ) , 
(1 + 2usjl + log(8/ 7T*min ))] < exp(— m 2 ) , 
(1 + 2uy/l + log(8/ 7T*min ))] < exp(— m 2 ) , 
(1 + 2u\/l + log(8/7r* m i n ))l < exp(—u 2 ). 


ps 


Proof. Set C Lm (Z 3 ,..., Z p ) \\Lm{Zi, • • •, Z p ) — L M || 2 , ■ • • > Z p ) := \\M. m {Zi, ■ • •, Z p ) — 

Mm||f, Cn m (Zh ■ ■ ■ i Z p ) := ||Nm(-Zi, • • •, Z p ) — N m ||f and ( Pm (Zi, ..., Z p ) := ||Pm(-^i, • • ■, Z p ) — 
P m ||f where, for instance Lm(-Zi,..., Z p ) denotes the dependence of Lm in Z\,..., Z p . We begin with 
Cm M 5 other cases are similar. Form the difference with respect to the coordinate i: 

C% •= sup ICmm^Ii ■ • • ,Z i - 1 ,Z i ,Z i+1 , . . . ,Z p ) - (m m (Zi, • • • , Zi- lt z', z i+1 ,..., z p )\ . 
zjey 3 ,z'ey 3 


By the triangular inequality. 


Ci < sup 
z 3 ey 3 ,z^ey 3 


■M!m(^1; • ■ • > Zi—h Zii ^i+li ■ * ■ i Zp) Mm (^ 1 , • ■ • > Zi— 1, ZZi+ 1, ..., z p ) 


so that 


1 

Cj < — sup 

P z i ey 3 ,z’ i Gy 3 


(paiyi^Pbiy^^Pciy^) 


‘Pa(y’i ) )<Pb(y , 2 > )<Pc(y' 


(i)s 


/W 



Eventually, we get that c, < 773 ($m)/p- By McDiarmid’s inequality [24], for all u > 0, 

,2 


P(||Mm — M m ||f> e 


Mm — M 


M F 


+ u) < exp — 


pu 


8T mix ?7|($M)/ 


We need the following lemma that can be deduced from [24], 
Lemma E.2. For any a,b,c £ {1 ,M}, 


1/2 


E 


y, -[Va{Ys)Pb(Y s+ \)<p c {Y s +2) - ^[Pa{Yi)ip b (Y2)pc{Y3)}] 


< -^e [ip a (Yi)M Y 2)pc(Y 3 ) - e [^ a (yi)^(y 2 )^ c (y 3 )]] 2 . 


P G 


ps 


Proof. Notice that (X \. Yf), (X 2 , Y 2 ), ... is homogenous, irreducible, aperiodic and stationary Markov chain 
on X x y, whose stationary distribution is tt(x, d y) := it x p x (dy). Observe that its transition kernel Q satisfies, 
for all x, x' £ X and all y, y' £ y. 


Q(x,y,x',dy') = Q^(x,x')p x i(dy'). 

The transition kernel Q can be viewed as an operator Q on the Hilbert space L 2 (- 7 r) defined, for all / £ L 2 (ff), 
by: 

(Qf)(x,y) :=%,„.,.)(/)= y Q*(x, x') f f{x',y')p, x '{dy'). 

x'ex 
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Note that Qf(x,y) does not depend on y. Set E := {f(x,y) £ L 2 ( 7 r) : / does not depend on y}. The 

L 2 ( 7 r)-self-adjoint operator defined, for all f £ L 2 ( 7 r), by 

(n E f)(x,y) := f f(x,y')/j, x (dy'), 

Jy 

is the orthogonal projection onto E. Since 11 /--;QI 1 = Q, the set of nonzero eigenvalues of Q is exactly the set 

of nonzero eigenvalues of the K dimensional linear operator II /.;QII/r. Eventually, note that the matrix of Q in 
the basis ((a:, y) i->- 1 x i— x ) x 'ex is Q*- Then, the pseudo spectral gap of Q is equal to <G ps (the pseudo spectral 
gap of Q*). 

Furthermore, note the same analysis can be made for (X \. X 2 . X :i . Z \), (X'l- X :i . X \. Z>),... and its pseudo 
spectral gap is the pseudo spectral gap of the Markov chain (X\, X 2 , X$), (X 2 , X 3 , X 4 ),... which is G ps . 
Indeed, the set of nonzero eigenvalues of the Markov chain (Xi, X 2 , X 3 ), (X 2 , X 3 , X 4 ),... is equal to the set 
of nonzero eigenvalues of the Markov chain Xi, X 2 , ■ . .. 

Eventually, set g(X s , X s+ 1 ,X s+2 , Z s ) := ( l/p)v a (Y s )<p b (Y s+ 1 )<p c {Y s+ 2 ) and apply Theorem 3.7 in [24] 
to conclude the proof. □ 

Then, 

r _ -| r _ it/ 2 

E ||M m — M m ||f <E ||Mm — Mm||f > 

(l p 

^ E ;E <Pa{Ys)<Pb{Ys+l)<Pc{y 8 + 2 ) - E [<Pa('5 / l)^b(y 2 )^c(E3)] 

a,b,c 5=1 

/ P 1 

< E e E {ty?a(Xs)ty?6(Xs+l) ( /?c(Xs+2) E [( Pa0^l) ( fb0 / 2) ( f’c0^3 )]} 

a,b,c \s=l ^ 

< JE E E Vpo,{Y 1 )ip b (Y 2 )(p c {Y3) - EMYi)MY 2 )MY 3 )} 2 

V^PB [aE 

/ 2 \ 1 / 2 r 1 1/2 

- V^r) [ E E(^( yi )^( y 2 )^( y 3)-^( y, 1 )^( y ' 2 )^( F, 3)) 2 J ’ 

' ' a,b,c 

< , 

\ P^ps J 

using Jensen’s inequality. Lemma E.2 and then 2E((7 —E(7 ) 2 < E (U — U 1 ) 2 where U is any real valued random 
variable with finite second moment and U' an independent copy of U. The proof is similar for L m, N m and 

P M- □ 

F Proof of Theorem C.3 

Preliminaries lemmas 

Lemma F.l. For all b £ {1 ,,M}, 

M m (. ,b,.) = Om® iafl[ 7 r*]Q*®iag[O m( 6 , .)]Q*Oj / . 

Similarly, P m = OA/®iag[7r*]Q* 2 Oj f . 
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Proof. Let a,c £ {1,..., M} 2 and observe that: 


(OMSia0[7r*]Q*Sia0[O M (&, • )]Q*0^)(a, c) 

= Y OAi(a, x 1 )Tr(x 1 )Q ir (xi, x 2 )0 M {b, x 2 )Q*(x 2 , x 3 )O m {c, x 3 ), 

(xi,X 2,X3)GX 3 

Y E[<p a (Yi)\X 1 =x 1 ]F(X 1 =x 1 )V(X 2 = x 2 \X 1 =x 1 ) 

(X1,X2,X3 )GA’ 3 

x E [ipb{Y 2 )\X 2 = x 2 \ P(X 3 = a: 3 |X 2 = :r 2 )E [y c (Y 3 )\X 3 = x 3 ] , 
= E[ifi a (Y 1 )ip b (Y 2 )<fi c (Y 3 )\ . 


Similarly, 

(O M Sict0[7r^]Q* 2 Oj :f )(a,c) = Y 0 M (a,xi)ir(xi)Q*(xi,x 2 )Q+(x 2 ,x 3 )0 M (c,x 3 ), 

(x\,X2,X3)€lX 3 

Y E['p a {Y 1 )\X 1 =x 1 ]V(X 1 =x 1 )V(X 2 =x 2 \X 1 =x 1 ) 

(xi,X2,X3)eX 3 

x P(X 3 = x 3 \X 2 = £ 2 )E [ip c (Y 3 )\X 3 = x 3 \ , 

= E[ip a (Y 1 )<p c (Y 3 )] , 


which concludes the proof. □ 

Lemma F.2. Let U be any (M x K) matrix such that PmU has rank K. Then, 

- for all b £ {1 ,,M}, 

B(6) := (P M U) t M M (. ,b ,.)U = R®ia 0 [O M (&,.)]R _1 , 

where H^ 1 := Q^O^U and (PmU)' := (U T Pj f PMU) _ 1 U T P)[ f denotes the Moore-Penrosepseudoin¬ 
verse of the matrix PmU ; 

- U t PmU is invertible and, for all b £ {1, ..., M}, 

B (b) = (U t P m U)- 1 U t M m (. ,b,.)U = RSia 0 [O M (b,. )]R ^ 1 . 

Proof. Observe that Mm(.,6 ,. )U = OM®iQ0[7r*]Q*Dia0[OM(&, • )]R _1 = PmUR2Kci0[Om(&> ■ )]R _1 
as claimed. □ 

Lemma F.3. Assume that 2||Pm — Pm||< &k (Pm), then: 


(i) 


(H) 


M ’ 


|Pm — Pm| 


o"a'(Pm) — ||Pm — Pm | 


< 1 , 


ctk(Pm) > 


&k(Pm) — ||Pm - P 


Ml 


o"a'(Pm) 


<jk(Pm) > 


cat(Pm) 


(Hi) a K (U T U) > (1 — 4 m) 1 / 2 > 

(iv) ^(U t P m U) > (1 - 4 m W( P m ), 

(v) foralla£M. K and for all v € Range(P m)> ||Ua — u|| 2 < ||a — U T n|| 2 +£p M ||f || 2 , 

(vi) i/3||P m - Pm||< cta-(Pm) then: 

ax(U T P M U) > ; 
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(vii) 


IICC^PmU )- 1 - (^PmU)- 1 !! < 


Pm-P 


Mil 


< 3.2 


<jk(Pm)(1 - 4 m )(( 1 _ e Pj„K( P m) - 
"Pm — Pm II 


Pm - P 


Ml 


'K 


(Pm) 


Proof. See Lemma C. 1 in [3] for the first five claims. The sixth claim follows from the fourth point and Theorem 
D. 1. The seventh point follows from the fourth claim and Theorem D.2. □ 


Control of the observable operator 

Claim (iv) in Lemma F.3 and Lemma F.2 ensure that, for all b £ {1,..., M }, 

B (b) := (U t P m U)- 1 U t M m (. ,b,.)t = R®iag[0 M (b, . )]R _1 , 

where R 1 may be defined as 

R- 1 := 2)iag[(||(Q*O^U) _1 (., 1)|| 2 , ..., ||(Q*C>t U)-^., K)\\ 2 )}Q+O t m U. 

Set A := 0 t U t O m and for all x £ X, C(x) := Z™i(Ue)(b,x)B(b) = RSictg[A(:r,. )]R _1 . Note that R 
has unit Euclidean norm columns: 


R = (Q*Oj f U) _1 ®ia0[(||(Q*Oj f U) _1 (., 1)|| 2 ,..., ||(Q*0][, U)"^., A^IU)]” 1 , 
corresponding to unit Euclidean norm eigenvectors of C (k). 

Lemma F.4. Assume that 3||Pm — Pm||< ctx(Pm). then, for all b £ {1,..., M}, 


l|B(6) 


B( 6 )||< 3.2 


I|Ma/(. , b,. )|| 

<j k (Pm) l ||M m (.,6,.)|| 


||Pm-Pm|| - 
<Lr:(Pm) J ’ 


and for all x € X, 


||C(cr) 


C(ar)||< 3.2. 


|M 


M11 oo,2 


Mm — M 


M ||oo ,2 


^x(Pm) L ||Mm||oo,2 


||Pm-Pm|| - 
cta(Pm) - 


Proof Observe that: 

||B(6) - B(&)||<||(U t P m U)- 1 U t M m (. , b ,. )U - (U t PmU)" 1 U t M m (. , b ,. )U|| 

+ || (U t P m U)- 1 U t M m (. , b ,. )U - (U T P A fU)- 1 U T M M (., b ,. )U||, 
<||U t (M m (. ,b ,.) - M m (. , b ,. ))U||||(U t P m U)- 1 || 

+ IK^PmU)” 1 - (U t P m U)- 1 ||||U t M m (. , b ,. )U||, 

<||Mm(. ,b,.)- M m (. , b ,. )||a^ 1 (U T P M U) 

+ ||M m (. ,b ,. )||||(U t P m U)- 1 - (U t PmU) _1 || . 


By claims (vi) and (vii) ofLemmaF.3, 3o-a(U T P m U) > a K { Pm) and ||(U T P M U) _1 - (U T P M U)” 1 ||< 
3.2 . Replacing Mm(- , b, . ) by Eb=i(U0)(6, /c)Mm(- , b, . ) yields the same result for || C(at) — 

COr)f □ 

Lemma F.5. Assume that 2||P A / — Pm||< oTt (Pm), then, 

(i) 

k(R) := ||R||||R _1 ||< k 2 (Q*OmU) < . 

1 - £ P M 
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(ii) 


„ „ k 2 (C) O t ) „ 

sv C (i)(C(l)) < «(R)||C(1) - C(l)||< ;^ 2 Mj ||C(l) - C(l)|| 

JL c TD , 


where sv C ( 1 )(C(l)) := max min A(l,xi) — A(l,x 2 ) 

xi&X X2&X 


(Hi) If in addition. 


- 2 (Q ^ ) ||0(1) - C(l)||< min |A(1, x) - A(l,x')| /2 , 

X.x'EX 


1 -ei 


M 


then C(l) has K distinct real eigenvalues and: 


md(C(l), C(l)) < 


« 2 (Q+oXf) 

i - 

1 M 


||C(1)-C(1)||, 


where md(C(l), C(l)) 


mm < max 

tG*Sk I x€X 


A(1,t(x)) - A(l, x) 


Proof. Observe that U is an orthonormal basis of range of O m- The first point follows from claim (iii) of 
Lemma F.3. The second point is derived from Theorem D.3 and the first point. The remark following Theorem 
D.3 proves the last point. □ 


Control of the spectra 

Lemma F. 6 . For any 0 < S < 1, 


26(1 — £p ) 1 ^ 2 

Vx,xi x 2 , |A(x,xi) - A(x,x 2 )|> 


y/eK 5 / 2 (K - 1) 


>1-5. 


Furthermore: 


. l+\/2,, 

D>--||Om||2,oo 


< 5. 


Proof Observe that: 

A(x,xi) - A(x,x 2 ) = (0(.,x), (U T Ojf)(. ,xi) - (U t Om)(- > x 2 )) 
= <0(., x), U t (O m (. , xi) - O m (. , x 2 ))). 


Furthermore, from (iii) in Lemma F.3, we get that: 

||U t (Om(- ,x\) — Om(- , £ 2 ))|| 2 > (1 — £ p m ) 1 / 2 ||Om(- ,x\) — Om(- ,x 2 )\\ 2 > (1 - 4 m ) 1 / 2 7 (O m ) • 

Similarly, note that: 

11A11 oo = max|( 0 (., x), U t Om(- , x'))\ , 

x,x' 

and ||U t Om(, , a/)|| 2 < ||Om(- , x , )|| 2 < ||Om|| 2 ,oo- For sake of readability, we borrow the result of Lemma 
F.2 and the argument of Lemma C .6 in [3] to conclude. □ 


Perturbation of simultaneously diagonalizable matrices 

Lemma F.7. 7/3||Pm — Pm||< &k(Pm) and: 


8.2K 5/2 (K - 1) 

43AK 4 (K - 1) 
and for all x, X\ ^ x 2 . 


HQO t m ) 


57(OmW(Pm) L 


‘(Q°m) 


6i(Om)ok(Pm) L 


||Mm — Mm||oo,2 + 
Mm — Mm||oo,2+ 


|M 


M||oo,2|| CM 


Pm ~ P 


M 


ck(Pm) 

|M m ||oo,2||Pm — Pm | 


|A(x,xi) - A(x,x 2 )|> 




y/eK 5 / 2 (K - 1) 


ctk(Pm) 

7(0m) , 


< 1 , 
< 1 , 


(33) 

(34) 
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and: 


1 + \/2 \og(K 2 /5) 


Vk 


O 


M ||2,oo i 


then there exists r £ Sk such that for all x £ X: 


IIA(., x) - A(. ,r(x))||oo< 


13 


« 2 (QOm) 


ow (Pm) 

^(Q0l f )||0 M ||2., 

(Pm) 


116A' 7 / 2 (AT - 1) {l + (2 log(A" 2 /<5)) 1/2 | 


I M m — Mm||oo,2+- 


|M 


M\\oo,2\\ r M 


Pm-P 


Ml 


^a'(Pm) 


Proof. Note £p M <1/2. Invoke the last part of Claim 4 of Lemma C.4 in [3] with 7 a ■£- --7(0 m). 


k(R) 4— 




, ||A|| 2 £- 4k (c *° m) , e A <- 3.21 |Mm||c 


(Pm) 


||M m -M m |U,2 


IIPm-Pa 


fM 


Mlloo,2 


o’k(Pm) 


and A n 


t+-\/-^s( A ^ L ‘ i ' 1 ||0 M || 2|00 . Observe that (33) agrees with £3 < 1/2 and (34) agrees with £4 < 1/2. 


□ 


Since 0 T is an isometry, observe that: 

||U t Om(- , x) - BA(., r(ar))|| 2 = ||A(.,at) - A(., r(at))|| 2 < v / A||A(. ,x) - A(., r(a;))|| 00 . 
Claim (v) in Lemma F.3 (with a = ©A(. , t(x)) and v = Om(- , x)) give 


|O m (- ,x) - O m (. ,r(a:))||2 < ||U t O m (- , x) - 0A(., r(a’))|| 2 + 


M 


3||Pm — P 
2&k(P m) 


|Om(- ,x)\\ 2 


< Vk\\A( : ,x) - A(., ^l| 00+ ^fl|0,(.,)||2. 


Theorem C.3 follows from Lemma F.7. 
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